These notes are derived primarily from Dobson and Barnett (2018) (mostly chapters 1-5).
Some material was also taken from McLachlan and Krishnan (2007) and Casella and Berger (2002).
Configuring R
Functions from these packages will be used throughout this document:
[R code]
library(conflicted) # check for conflicting function definitions# library(printr) # inserts help-file output into markdown outputlibrary(rmarkdown) # Convert R Markdown documents into a variety of formats.library(pander) # format tables for markdownlibrary(ggplot2) # graphicslibrary(ggfortify) # help with graphicslibrary(dplyr) # manipulate datalibrary(tibble) # `tibble`s extend `data.frame`slibrary(magrittr) # `%>%` and other additional piping toolslibrary(haven) # import Stata fileslibrary(knitr) # format R output for markdownlibrary(tidyr) # Tools to help to create tidy datalibrary(plotly) # interactive graphicslibrary(dobson) # datasets from Dobson and Barnett 2018library(parameters) # format model output tables for markdownlibrary(haven) # import Stata fileslibrary(latex2exp) # use LaTeX in R code (for figures and tables)library(fs) # filesystem path manipulationslibrary(survival) # survival analysislibrary(survminer) # survival analysis graphicslibrary(KMsurv) # datasets from Klein and Moeschbergerlibrary(parameters) # format model output tables forlibrary(webshot2) # convert interactive content to static for pdflibrary(forcats) # functions for categorical variables ("factors")library(stringr) # functions for dealing with stringslibrary(lubridate) # functions for dealing with dates and times
Here are some R settings I use in this document:
[R code]
rm(list =ls()) # delete any data that's already loaded into Rconflicts_prefer(dplyr::filter)ggplot2::theme_set( ggplot2::theme_bw() +# ggplot2::labs(col = "") + ggplot2::theme(legend.position ="bottom",text = ggplot2::element_text(size =12, family ="serif")))knitr::opts_chunk$set(message =FALSE)options('digits'=6)panderOptions("big.mark", ",")pander::panderOptions("table.emphasize.rownames", FALSE)pander::panderOptions("table.split.table", Inf)conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by defaultlegend_text_size =9run_graphs =TRUE
1.1 Overview of maximum likelihood estimation
The likelihood function
Definition 1 (Likelihood of a single observation) Let \(X\) be a random variable and let \(x\) be \(X\)’s observed data value. Let \(\text{p}_{\Theta}(X=x)\) be a probability model for the distribution of \(X\), with parameter vector \(\Theta\).
Then the likelihood of parameter value \(\theta\), for model \(\text{p}_{\Theta}(X=x)\) and data \(X = x\), is simply the probability of the event \(X=x\) given \(\Theta= \theta\):
Definition 2 (Likelihood of a dataset) Let \(\tilde{x} \stackrel{\text{def}}{=}x_1, \ldots, x_n\) be a dataset with corresponding random variable \(\tilde{X}\). Let \(\text{p}_{\Theta}(\tilde{X})\) be a probability model for the distribution of \(\tilde{X}\) with unknown parameter vector \(\Theta\).
Then the likelihood of parameter value \(\theta\), for model \(\text{p}_{\Theta}(X)\) and data \(\tilde{X}= \tilde{x}\), is the joint probability of \(\tilde{X}= \tilde{x}\) given \(\Theta= \theta\):
All of these notations mean the same thing. The parameter vector \(\theta\) is often listed first or solely, either to emphasize that we are interested in how this function varies with the parameters, given the data, or possibly to make the likelihood resemble the Bayesian posterior probability \(\text{p}(\theta | \tilde{x})\), hinting at the fact that if the prior probability \(\text{p}(\theta)\) is uniform over some finite parameter space, the posterior probability is proportional to the likelihood:
Proof. \[
\begin{aligned}
\mathscr{L}(\tilde{x}|\theta)
&\stackrel{\text{def}}{=}\text{p}(X_1 = x_1, …,X_n =x_n|\theta)
\\&= \prod_{i=1}^n \text{p}(X_i=x_i|\theta)
\end{aligned}
\] The second equality is by the definition of statistical independence.
Definition 3 (Likelihood components) Given an \(\text{iid}\) dataset \(\tilde{x}\), the likelihood component or likelihood factor of observation \(X_i=x_i\) is the marginal likelihood of \(X_i=x_i\):
\[\mathscr{L}_i(\theta) = \text{P}(X_i=x_i)\]
Theorem 2 For \(\text{iid}\) data \(\tilde{x}\stackrel{\text{def}}{=}x_1, \ldots, x_n\), the likelihood of the dataset is equal to the product of the observation-specific likelihood factors:
Definition 4 (Maximum likelihood estimate) The maximum likelihood estimate of a parameter vector \(\Theta\), denoted \(\hat\theta_{\text{ML}}\), is the value of \(\Theta\) that maximizes the likelihood:
Recall from calculus: the maxima of a continuous function \(f(x)\) over a range of input values \(\mathcal{R}(x)\) can be found either:
at the edges of the range of input values, OR:
where the function is flat (i.e. where the gradient function \(f'(x) = 0\)) AND the second derivative is negative definite (\(f''(x) < 0\)).
Directly maximizing the likelihood function for independent data
To find the maximizer(s) of the likelihood function, we need to solve \(\mathscr{L}'(\theta) = 0\) for \(\theta\). However, even for mutually independent data, we quickly run into a problem:
The derivative of the likelihood of independent data is the derivative of a product. To evaluate this derivative, we will have to perform a massive application of the product rule (?@thm-product-rule).
The log-likelihood function
Definition 5 (Log-likelihood) The log-likelihood of parameter value \(\theta\), for model \(\text{p}_{\Theta}(\tilde{X})\) and data \(\tilde{X}= \tilde{x}\), is the natural logarithm of the likelihood1:
The first derivative1 of the log-likelihood, \(\ell'(\theta)\), is important enough to have its own name: the score function.
Definition 6 (Score function) The score function of a statistical model \(\text{p}(\tilde{X}=\tilde{x})\) is the gradient (i.e., first derivative) of the log-likelihood of that model:
Exercise 3 Derive the score function for a single Bernoulli random variable \(X\). In other words, differentiate the marginal log-likelihood of a single Bernoulli random variable \(X\) with respect to the event probability parameter, \(\pi\). Simplify as much as possible.
Definition 7 (Observed information matrix) The observed information matrix, denoted \(I\), is defined as the negative of the Hessian of the log-likelihood:
Definition 8 (Expected information/Fisher information) The expected information matrix, also known as the Fisher information matrix, is denoted \(\mathcal{I}\) and is defined as the expected value of the observed information matrix:
The variance of \(\ell'(x,\theta)\), \({Cov}\left\{ \ell'(x,\theta) \right\}\), is also very important; we call it the “expected information matrix”, “Fisher information matrix”, or just “information matrix”, and we represent it using the symbol \(\mathcal{I}{\left(I\right)}\) (\scriptI in Unicode, \mathcal{I} in LaTeX).
Note that \(\text{E}{\left[\ell'\right]}\) and \(\text{E}{\left[\ell'{\ell'}^{\top}\right]}\) are functions of \(\theta\) but not of \(x\); the expectation operator removed \(x\).
Also note that for most of the distributions you are familiar with (including Gaussian, binomial, Poisson, exponential):
Theorem 7 (Elements of the Hessian matrix) If \(\tilde{\theta}\) is a \(p\times 1\) vector, then the Hessian is a \(p \times p\) matrix, whose \(ij^{th}\) entry is:
Sometimes, we use \(I(\theta;x) \stackrel{\text{def}}{=}- hess\) (note the standard-font “I” here). \(I(\theta;x)\) is the observed information, precision, or concentration matrix (Negative Hessian).
Key point
The asymptotics of MLEs gives us \({\widehat{\theta}}_{ML} \sim N\left( \theta,\mathfrak{I}^{- 1}(\theta) \right)\), approximately, for large sample sizes.
We can estimate \(\mathcal{I}^{- 1}(\theta)\) by working out \(\text{E}{\left[-\ell''\right]}\) or \(\text{E}{\left[\ell'{\ell'}^{\top}\right]}\) and plugging in \(\hat\theta_{\text{ML}}\), but sometimes we instead use \(I(\hat\theta_{\text{ML}},\tilde{x})\) for convenience; there are some cases where it’s provably better according to some criteria (Efron and Hinkley (1978)).
Quantifying (un)certainty of MLEs
Confidence intervals for MLEs
An asymptotic approximation of a 95% confidence interval for \(\theta_k\) is
Where \(m\) is the sample size of the new data to be predicted (typically 1, except for binary outcomes, where it needs to be bigger for prediction intervals to make sense).
1.2 Example: Maximum likelihood for Tropical Cyclones in Australia
The cyclones dataset in the dobson package (Table 1) records the number of tropical cyclones in Northeastern Australia during 13 November-to-April cyclone seasons (more details in Dobson and Barnett (2018) §1.6.5 and help(cyclones, package = "dobson")). Figure 1 graphs the number of cyclones (y-axis) by season (x-axis). Let’s use \(Y_i\) to represent these counts, where \(i\) is an indexing variable for the seasons and \(Y_i\) is the number of cyclones in season \(i\).
Exploratory analysis
Suppose we want to learn about how many cyclones to expect per season.
Table 1: Number of tropical cyclones during a season from November to April in Northeastern Australia
season
years
number
1
1956/7
6
2
1957/8
5
3
1958/9
4
4
1959/60
6
5
1960/1
6
6
1961/2
3
7
1962/3
12
8
1963/4
7
9
1964/5
4
10
1965/6
2
11
1966/7
6
12
1967/8
7
13
1968/9
4
[R code]
library(ggplot2)library(dplyr)cyclones |>mutate(years =factor(years, levels = years)) |>ggplot(aes(x = years, y = number, group =1)) +geom_point() +geom_line() +xlab("Season") +ylab("Number of cyclones") +expand_limits(y =0) +theme(axis.text.x =element_text(vjust = .5, angle =45))
Figure 1: Number of tropical cyclones per season in northeastern Australia, 1956-1969
There’s no obvious correlation between adjacent seasons, so let’s assume that each season is independent of the others.
Let’s also assume that they are identically distributed; let’s denote this distribution as \(P(Y=y)\). Note that there’s no index \(i\) in this expression, since we are assuming the \(Y_i\)s are identically distributed.
We can visualize the distribution using a bar plot (Figure 2).
[R code]
cyclones |>ggplot() +geom_histogram(aes(x = number)) +expand_limits(x =0) +xlab("Number of cyclones") +ylab("Count (number of seasons)")
We want to estimate \(P(Y=y)\); that is, \(P(Y=y)\) is our estimand.
We could estimate \(P(Y=y)\) for each value of \(y\) in \(0:\infty\) separately (“nonparametrically”) using the fraction of our data with \(Y_i=y\), but then we would be estimating an infinitely large set of parameters, and we would have low precision. We will probably do better with a parametric model.
Exercise 7 What parametric probability distribution family might we use to model this empirical distribution?
Solution. Let’s use the Poisson. The Poisson distribution is appropriate for this data , because the data are counts that could theoretically take any integer value (discrete) in the range \(0:\infty\). Visually, the plot of our data closely resembles a Poisson or binomial distribution. Since cyclones do not have an “upper limit” on the number of events we could potentially observe in one season, the Poisson distribution is more appropriate than the binomial.
Exercise 8 Write down the Poisson distribution’s probability mass function.
Estimating the model parameters using maximum likelihood
Now, we can estimate the parameter \(\lambda\) for this distribution using maximum likelihood estimation.
Exercise 9 (What is the likelihood?) Write down the likelihood (probability mass function or probability density function) of a single observation \(x\), according to your model.
Exercise 10 Write down the vector of parameters in your model.
Solution. There is only one parameter, \(\lambda\):
\[\theta = (\lambda)\]
Exercise 11 Write down the population mean and variance of a single observation from your chosen probability model, as a function of the parameters (extra credit - derive them).
Solution.
Population mean: \(\text{E}[X]=\lambda\)
Population variance: \(\text{Var}(X)=\lambda\)
Exercise 12 Write down the likelihood of the full dataset.
Exercise 23 Draw conclusions about the MLE of \(\lambda\).
Solution. Since \(\ell''(\tilde \lambda; \tilde{x})<0\), \(\tilde \lambda\) is at least a local maximizer of the likelihood function \(\mathscr{L}(\lambda)\). Since there is only one solution to the estimating equation and the Hessian is negative definite everywhere, \(\tilde \lambda\) must also be the global maximizer of \(\mathscr{L}(\lambda; \tilde{x})\):
[R code]
mle <-mean(cyclones$number)
\[\hat{\lambda}_{\text{ML}} = \bar x = 5.538462\]
Exercise 24 Graph the log-likelihood with the MLE superimposed.
Solution.
[R code]
library(dplyr)mle_data <-tibble(x = mle, y =loglik(mle))ll_plot +geom_point(data = mle_data, aes(x = x, y = y), col ="red")
Figure 7: log-likelihood of Dobson cyclone data with MLE
Information matrices
[R code]
obs_inf <-function(...) -hessian(...)ggplot() +geom_function(fun = obs_inf, n =1001) +xlim(min(cyclones$number), max(cyclones$number)) +ylab("I(lambda)") +xlab("lambda") +geom_hline(yintercept =0, col ="red")
Figure 8: Observed information function of Dobson cyclone data
1.3 Finding the MLE using the Newton-Raphson algorithm
For computational simplicity, we will sometimes use \(\mathfrak{I}^{- 1}(\theta)\) in place of \(I\left( \widehat{\theta},y \right)\); doing so is called “Fisher scoring” or the “method of scoring”. Note: this substitution is the opposite of the substitution that we are making for estimating the variance of the MLE; this time we should technically use the observed information but we use the expected information instead.
There’s also an “empirical information matrix” (see McLachlan and Krishnan (2007)):
Figure 13: Newton-Raphson algorithm for finding MLE of model 13 for cyclone data
1.4 Maximum likelihood inference for univariate Gaussian models
Suppose \(X_{1}, ..., X_{n} \ \sim_{\text{iid}}\ N(\mu, \sigma^{2})\). Let \(X = (X_{1},\ldots,X_{n})^{\top}\) be these random variables in vector format. Let \(x_{i}\) and \(x\) denote the corresponding observed data. Then \(\theta = (\mu,\sigma^{2})\) is the vector of true parameters, and \(\Theta = (\text{M}, \Sigma^2)\) is the vector of parameters as a random vector.
When solving for \({\widehat{\sigma}}_{ML}\), you can treat \(\sigma^{2}\) as an atomic variable (don’t differentiate with respect to \(\sigma\) or things get messy). In fact, you can replace \(\sigma^{2}\) with \(1/\tau\) and differentiate with respect to \(\tau\) instead, and the process might be even easier.
See Casella and Berger (2002) p322, example 7.2.12.
To prove it’s a maximum, we need:
\(\ell' = 0\)
At least one diagonal element of \(\ell''\) is negative.
Determinant of \(\ell''\) is positive.
1.5 Example: hormone therapy study
[R code]
# load the data directly from a UCSF websitehers <- haven::read_dta(paste0( # I'm breaking up the url into two chunks for readability"https://regression.ucsf.edu/sites/g/files","/tkssra6706/f/wysiwyg/home/data/hersdata.dta" ))
[R code]
hers |>head()
Table 4: HERS dataset
[R code]
n_obs <-100# we're going to take a small subset of the data to look at;# if we took the whole data set, the likelihood function would be hard to# graph nicelylibrary(dplyr)data1 <- hers |>filter( diabetes ==0, exercise ==0 ) |>head(n_obs)glucose_data <- data1 |>pull(glucose)library(ggplot2)plot1 <- data1 |>ggplot(aes(x = glucose)) +geom_histogram(aes(x = glucose, after_stat(density))) +theme_classic()print(plot1)
Looks somewhat plausibly Gaussian. Good enough for this example!
Looks like a somewhat decent fit? We could probably do better, but that’s for another time.
Construct the likelihood and log-likelihood functions
[R code]
loglik <-function( mu, # I'm assigning default values, which the function will use# unless we tell it otherwisesigma =sd(x), # note that you can define some default inputs# based on other argumentsx = glucose_data,n =length(x)) { normalizing_constants <--n /2*log((sigma^2) *2* pi) likelihood_kernel <--1/ (2* sigma^2) * {# I have to do this part in a somewhat complicated way# so that we can pass in vectors of possible values of mu# and get the likelihood for each value;# for the binomial case it's easiersum(x^2) -2*sum(x) * mu + n * mu^2 } answer <- normalizing_constants + likelihood_kernelreturn(answer)}# `...` means pass any inputs to lik() along to loglik()lik <-function(...) exp(loglik(...))
Graph the Likelihood as a function of \(\mu\)
(fixing \(\sigma^2\) at \(\hat\sigma^2 = 104.7444\))
se_mu_hat <-function(n, sigma =sd(glucose_data)) sigma /sqrt(n)ggplot() +geom_function(fun = se_mu_hat) +scale_x_log10(limits =c(10, 10^5), name ="Sample size",labels = scales::label_comma() ) +ylab("Standard error of mu (mg/dl)") +theme_classic()
Power
Rejection region
For example, suppose we wish to detect a difference from the hypothesized value \(\mu_0 = 95\). We reject the null hypothesis for any mean value outside the “non-rejection interval” \[ \mu_0 \pm F^{-1}_{t(n-1)}(1-\alpha/2) \sqrt{\frac{\sigma^2}{n}} \]
In this case, the non-rejection interval is \([92.959028, 97.040972]\).
Calculate power under a simple alternative
Consider the simple alternative that the true value is actually the estimated mean calculated from the data (i.e. \(98.66\)). Let’s also assume that the known standard deviation is what we estimated from the data.
power <-function(n =100, null =95, alt =98.66) {# there's no such thing as fractional sample size: n <-floor(n)# using the function we wrote earlier: se <-se_mu_hat(n = n) reject_upper <- ((null +qt(0.975, df = n -1) * se) - alt) / se reject_lower <- ((null -qt(0.975, df = n -1) * se) - alt) / se p_reject_high <-pt(q = reject_lower,df = n -1 ) p_reject_low <-pt(q = reject_upper,df = n -1,lower =FALSE ) p_reject <- p_reject_high + p_reject_lowreturn(p_reject)}power_plot <-ggplot() +geom_function(fun = power, n =100) +xlim(c(2, 200)) +# n = 1 is not allowed for t-distributionylim(0, 1) +ylab("Power") +xlab("n") +theme_bw()print(power_plot)
Simulations
Create simulation framework
Here’s a function that performs a single simulation of a Gaussian modeling analysis:
[R code]
do_one_sim <-function(n =100,mu =mean(glucose_data),mu_0 =mean(glucose_data) *0.9,sigma2 =var(glucose_data),return_data =FALSE# if this is set to true, we will create a list() # containing both the analytic results and the vector of simulated data) {# generate data x <-rnorm(n =100, mean = mu, sd =sqrt(sigma2))# analyze data mu_hat <-mean(x) sigmahat <-sd(x) se_hat <- sigmahat /sqrt(n) confint <- mu_hat +c(-1, 1) * se_hat *qt(.975, df = n -1) tstat <-abs(mu_hat - mu_0) / se_hat pval <-pt(df = n -1, q = tstat, lower =FALSE) *2 confint_covers <-between(mu, confint[1], confint[2]) test_rejects <- pval <0.05# if you want spaces, hyphens, or characters in your column names, # use "", '', or ``: to_return <-tibble("mu-hat"= mu_hat,"sigma-hat"= sigmahat,"se_hat"= se_hat,"confint_left"= confint[1],"confint_right"= confint[2],"tstat"= tstat,"pval"= pval,"confint covers true mu"= confint_covers,"test rejects null hypothesis"= test_rejects )if (return_data) {return(list(data = x,results = to_return ) ) } else {return(to_return) }}
Let’s see what this function outputs for us:
[R code]
do_one_sim()
Looks good!
Now let’s check it against the t.test() function from the stats package:
Here’s a function that calls the previous function n_sims times and summarizes the results:
[R code]
do_n_sims <-function(n_sims =1000, ... # this symbol means "allow additional arguments to be passed on to the # `do_sim_once` function) { sim_results <-NULL# we're going to create a "tibble" of results,# row by row (slightly different from the hint on the homework)for (i in1:n_sims) {set.seed(i) # sets a different seed for each simulation iteration, # to get a different dataset each time current_results <-do_one_sim(...) |># here's where the simulation actually gets runmutate(sim_number = i ) |>relocate("sim_number", .before =everything()) sim_results <- sim_results |>bind_rows(current_results) }return(sim_results)}
[R code]
sim_results <-do_n_sims(n_sims =1000,mu =mean(glucose_data),sigma2 =var(glucose_data),n =100# this is the number of samples per simulated data set)sim_results
The simulation results are in! Now we have to analyze them.
Analyze simulation results
To do that, we write another function:
[R code]
summarize_sim <-function( sim_results,mu =mean(glucose_data),sigma2 =var(glucose_data),n =100) {# calculate the true standard error based on the data-generating parameters: se_mu_hat <-sqrt(sigma2 / n) sim_results |>summarize(`bias[mu-hat]`=mean(.data$`mu-hat`) - mu,`SE(mu-hat)`=sd(.data$`mu-hat`),`bias[SE-hat]`=mean(.data$se_hat) - se_mu_hat,`SE(SE-hat)`=sd(.data$se_hat),coverage =mean(.data$`confint covers true mu`),power =mean(.data$`test rejects null hypothesis`) )}
Let’s try it out:
[R code]
sim_summary <-summarize_sim( sim_results,mu =mean(glucose_data),# this function needs to know the true parameter values in order to assess # biassigma2 =var(glucose_data),n =100)sim_summary
From this simulation, we observe that our estimate of \(\mu\), \(\hat\mu\), has minimal bias, and so does our estimate of \(SE(\hat\mu)\), \(\hat{SE}(\hat\mu)\).
The confidence intervals captured the true value even more often than they were supposed to, and the hypothesis test always rejected the null hypothesis.
I wonder what would happen with a different sample size, a different true \(\mu\) value, or a different \(\sigma^2\) value…
# load the data from packagehers = haven::read_dta(fs::path_package("rme", "extdata/hersdata.dta"))# "https://regression.ucsf.edu/sites/g/files/tkssra16191/files/wysiwyg/home/data/hersdata.dta"
Looks somewhat plausibly Gaussian. Good enough for this example!
1.7 Construct the likelihood and log-likelihood functions
[R code]
# it's computationally better to construct the log-likelihood first and then # exponentiate it to get the likelihoodloglik =function(mu =mean(x), # I'm assigning default values, which the function will use # unless we tell it otherwisesigma =sd(x), # note that you can define some defaults based on other argumentsx = glucose_data, n =length(x)){ normalizing_constants =-n/2*log((sigma^2) *2* pi) likelihood_kernel =-1/(2* sigma^2) * {# I have to do this part in a somewhat complicated way# so that we can pass in vectors of possible values of mu# and get the likelihood for each value;# for the binomial case it's easiersum(x^2) -2*sum(x) * mu + n * mu^2 } answer = normalizing_constants + likelihood_kernelreturn(answer)}# `...` means pass any inputs to lik() along to loglik()lik =function(...) exp(loglik(...))
## Graph the log-likelihood ranging over both parameters at once:library(plotly)n_points =25mu =seq(90, 105, length.out = n_points)sigma =seq(6, 20, length.out = n_points)names(mu) =round(mu, 5)names(sigma) =round(sigma, 5)lliks =outer(mu, sigma, loglik)liks =outer(mu, sigma, lik)plotly::plot_ly(type ="surface",x =~mu,y =~sigma,z =~t(lliks))
[R code]
# see https://stackoverflow.com/questions/69472185/correct-use-of-coordinates-to-plot-surface-data-with-plotly for explanation of why transpose `t()` is needed
Figure 16: Log-likelihood of hers data w.r.t. \(\mu\) and \(\sigma^2\)
Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.
Efron, Bradley, and David V Hinkley. 1978. “Assessing the Accuracy of the Maximum Likelihood Estimator: Observed Versus Expected Fisher Information.”Biometrika 65 (3): 457–83.
Hogg, Robert V., Elliot A. Tanis, and Dale L. Zimmerman. 2015. Probability and Statistical Inference. Ninth edition. Boston: Pearson.
Hulley, Stephen, Deborah Grady, Trudy Bush, Curt Furberg, David Herrington, Betty Riggs, Eric Vittinghoff, for the Heart, and Estrogen/progestin Replacement Study (HERS) Research Group. 1998. “Randomized Trial of Estrogen Plus Progestin for Secondary Prevention of Coronary Heart Disease in Postmenopausal Women.”JAMA : The Journal of the American Medical Association 280 (7): 605–13.
Lehmann, E. L. 1999. Elements of Large-Sample Theory. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/b98855.
McLachlan, Geoffrey J, and Thriyambakam Krishnan. 2007. The EM Algorithm and Extensions. 2nd ed. John Wiley & Sons. https://doi.org/10.1002/9780470191613.
Newey, Whitney K, and Daniel McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” In Handbook of Econometrics, edited by Robert Engle and Dan McFadden, 4:2111–2245. Elsevier. https://doi.org/https://doi.org/10.1016/S1573-4412(05)80005-4.